Multiplicity in Clinical Trials

Introduction

  • Type I error rate inflated when conducting multiple hypothesis tests $(m)$ each with the nominal 0.05 significance level $(\alpha)$ --> the multiplicity problem
  • Source of multiplicity in clinical trials
    • multiple arms
    • control for more than one endpoint
    • control for more than one population
    • control repeatedly in time
    • etc.
  • Dealing with multiplicity
    • Reducing the degree of multiplicity
      • limit the number of questions
      • minimize the number of variables by using e.g. composite endpoints, summary statistic, etc.
      • Prioritizing questions
    • If multiplicity still persists
      • multiplicity adjustment (refer to regulatory guidance)

Common multiple test procedures

Basic concepts

  • Family-wise error rate (FWER): overall type I error rate when testing a family of null hypotheses

    • aim: $Pr(\text{reject at least one true null}) \le \alpha$
  • Ajusted p-values: extend ordinary (i.e. unadjusted) p-values by adjusting them for a given multiple test procedure, which can be compared directly with the significance level 𝛼, while controlling the FWER

    • Formally, the adjusted p-value is the smallest significance level at which a given hypothesis is significant as part of the multiple test procedure. e.g. Bonferroni method
  • Single step methods

    • The rejection or non-rejection of a single hypothesis does not depend on the decision on any other hypothesis.
    • e.g. Bonferroni, Simes, Dunnett, etc.
  • Stepwise methods

    • The rejection or non-rejection of a particular hypothesis may depend on the decision on other hypotheses.
    • e.g. Holm, Hochberg, stepdown Dunnett, …

Methods

Bonferroni

  • Use 𝛼/𝑚 for all inferences; for 𝑖=1,…,𝑚: $$\text{Reject } H_i \text{ if } p_i \le \alpha/m$$ or with adjusted p-values $q_i = \min(mp_i, 1)$, $$\text{Reject } H_i \text{ if } q_i \le \alpha$$
  • This method follows the idea of Boole's inequality: $Pr(\cup A_i)\le \sum_i Pr(A_i)$, where $A_i = \{p_i\le \alpha/m\}$ denotes the event of rejecting $H_i$
  • Properties
    • Conservative if the number of hypotheses is large or the test statistics are strongly positively correlated
    • Can be improved by using stepwise methods (e.g. Holm procedure) and accounting for correlations (e.g. Dunnett test)
    • Rarely used in practice but is the basis for commonly used advanced procedures

Holm

  • Overview
    • Using ordinary p-values Holm method-ordinary p-value
    • Using ajusted p-values Holm method-adjusted p-value
  • Properties
    • A stepwise procedure and more powerful than Bonferroni method
    • Sometimes called "stepdown Bonferroni" procedure
    • Can be improved by accounting for correlations (e.g. stepdown Dunnett test)

Simes

  • Overview 输入图片描述

  • Comparison with Bonferroni 输入图片描述

    • Simes is more powerful than a global test based on Bonferroni
    • Simes assumes non-negative correlations between p-values, Bonferroni doesn't

Hochberg (stepwise version of Simes method/stepup Simes)

  • Overview 输入图片描述
  • Properties
    • Stepup Simes
    • More powerful than Holm procedure
      • Both use same thresholds, but Hochberg starts with the largest p-value, whereas Holm starts with the smallest
    • It makes same assumption as the Simes test, i.e. independence or positive dependence of p-values
    • Can be improved, e.g. Hommel procedure based on the closed test procedure.

Dunnett

  • When comparing several treatments with a control

  • Other methods mentioned above can also be used but only Dunnett test exploits the correlation between the p-values

  • Overview

    • linear model and hypotheses 输入图片描述
    • individual test statistics 输入图片描述
    • rejection rule 输入图片描述
  • Properties

    • Single step test, which is better than Bonferroni as it exploits the known correlations between test statistics
    • Adjusted p-values can be calculated numerically based on the multivariate t-distribution
    • The Dunnett test shown here can be extended to any linear and generalized linear model
    • It can be improved by extending it to a stepwise procedure, similar to the Holm procedure
    • Other well-known parametric tests follow the same principle. For example, the Tukey test compares all treatment groups against each other, also using a multivariate 􀝐-distribution

Stepwise Dunnett

  • Overview 输入图片描述

  • Properties

    • the quantiles change as hypotheses are rejected; e.g. if $H_{(1)}$ is rejected, then the quantile $c_{m-1, 1-\alpha}$ is computed from a (m-1)-variate t-distribution
    • the stepwise Dunnett test is better than the single step Dunnett test
      • it can be shown that $c_{m, 1-\alpha} \ge c_{m-1, 1-\alpha}\le \cdots \le c_{1, 1-\alpha}$, where $c_{1, 1-\alpha} = t_{v, 1-\alpha}$ is the quantile from the univariate t-distribution with $v$ degrees of freedom
      • The Dunnett test uses $c_{m, 1-\alpha}$ for all comparisons
    • the stepwise Dunnet test is better than the Holm procedure as it exploits the known correlations between test statistics
      • The stepwise version shown here is sometimes called "stepdown Dunnett" test
      • A "stepup Dunnett" test also exist, similar to Hochberg

Summary

输入图片描述

  • Stepwise methods are preferred over single step methods, which are less powerful and less used in practice
  • Accounting for correlations leads to more powerful procedures, but correlations are not always known
  • Simes-based methods are more powerful than Bonferroni-based methods, but control the FWER only under certain dependence structure
  • In practice, we select the procedure that is not only powerful from a statistical perspective, but also appropriate from clinical perspective

Hierarchical test procedure

Background

  • Previous multiple tests methods do not reflect the relative importance of the two endpoints, which is usually the case in RCT, where we have primary/secondary/exploratory endpoints with ordered importance
  • Previous stepwise procedures use a data-driven order of hypotheses, whereas in the RCT setting we need a multiple test procedure that specifies the order of the hypothesis based on clinical importance
  • Hierarchical test procedure: the hierarchy of hypotheses is specified before data is observed

Fixed sequence procedure

  • Overview 输入图片描述 输入图片描述

  • Properties

    • Adjusted p-values are given by $q_i = \max\{p_1, \cdots, p_i\}, i = 1, \cdots, m$
    • Advantages
      • Simple
      • Optimal when hypotheses early in the sequence are associated with large effects and performs poorly otherwise
    • Disadvantages
      • Once a hypothesis is not rejected, no further testing is permitted
    • Great care is advised when specifying the sequence of hypotheses

Fallback procedure

  • Overview 输入图片描述

  • Properties

    • The fixed sequence procedure is obtained as a special case from the fallback procedure by setting $\alpha_1=\alpha$ and $\alpha_i=0$ for $i>1$
    • In contrast to the fixed sequence procedure, fallback procedure tests all hypotheses in the pre-specified sequence even if the intitial hypotheses are not rejected

Closed test procedure (CTP)

  • Overview/formal definition 输入图片描述
    • Test the iteraction hypotheses using Bonferroni, Simes, Dunnett, etc. at level $\alpha$
    • Test each individual hypothesis at level $\alpha$

CTP using Bonferroni ( == Holm procedure)

输入图片描述

CTP usign Simes

  • When m=2, it's equivalent to Hochberg procedure
  • When m>2, it's less powerful 输入图片描述

CTP using Dunnett

  • This is equivalent to stepdown Dunnett procedure 输入图片描述

CTP using weighted Bonferroni

  • The first is equivalent to the the fixed sequence procedure 输入图片描述

  • The second version is equivalent to the fallback procedure 输入图片描述

What if more than two hypotheses?

  • Do CTP for pairwise combinations 输入图片描述

Summary

输入图片描述

Summary and Conclusions

  • Closed test procedure is a general principle to construct powerful multiple test procedures; many common procedures are CTPs
  • For structured hypotheses, one can apply the graphical approach, which is based on CTPs
  • It is critical to choose the suitable method for a particular problem
  • There are different types of multiplicity problems that need other methods than those described here, such as:
    • Safety data analyses
    • Large-scale testing in genetics, proteomics etc.
    • Post-hoc analyses / data snooping

Graphical approach[1][2]

  • Initial allocation of the significance level to $m$ hypothesis: $\alpha_1 + \cdots + \alpha_m = \alpha$
  • $\alpha$-propagation: if a hypothesis $H_i$ is rejected at level $\alpha_i$, propagate its level $\alpha_i$ to the remaining, not yet rejected hypotheses (according to aprefixed rule) and continue testing with the updated $\alpha$ levels

Conventions

Weighted Holm procedure: i.e. $\alpha$ is no longer evenly splited among hypotheses

graphical method - conventions

Common multiple test procedues

  • Fixed sequence procedure
  • Fallback procedure

Formal description

  • Initial levels $\alpha = (\alpha_1, \cdots, \alpha_m)$ with $\sum_{i=1}^m\alpha_i = \alpha \in (0, 1)$

  • $m \times m$ Transition matrix $\bf{G}=(g_{ij})$, Where $g_{ij}$ is the fraction of the level of $H_i$ that is propagated to $H_j$ with $0\le g_{ij} \le 1, g_{ii} = 0$ and $\sum_{j=1}^mg_{ij}\le1, \forall i=1, \cdots, m$

    ($G, a$) determine a graph with an associated multiple test

  • Update algorithm

    graphical approach - update

  • The initial levels $\alpha$, the transition matrix 𝑮, and the algorithm define a unique sequentially rejective test procedure that controls the FWER at level $\alpha$

  • Any multiple test procedure derived and visualized by a graph ($G, \alpha$) is based on the closed test principle

  • The graph ($G, \alpha$) and the algorithm define weighted Bonferroni tests for each intersection hypothsis in a CTP

  • The algorithm defines a shortcut for the resulting CTP, which does not depend on the rejection sequence

  • Tools: R {gMPA} package

Summary

  • Tailor advanced multiple test procedures to structured families of hypotheses
  • Visualize complex decision strategies in an efficient and easily communicable way
  • Ensure strong FWER control
  • It covers many common multiple test procedures as specifal cases: Holm, fixed sequence, fallback, gatekeeping, etc.

References


  1. Bretz, Frank and Maurer, Willi and Brannath, Werner and Posch, Martin (2009). A graphical approach to sequentially rejective multiple test procedures. ↩︎

  2. Bretz, Frank and Posch, Martin and Glimm, Ekkehard and Klinglmueller, Florian and Maurer, Willi and Rohmeyer, Kornelius (2011). Graphical approaches for multiple comparison procedures using weighted Bonferroni, Simes, or parametric tests. ↩︎